Data Visualization

Carlina Feldmann
Lennart Oelschläger

Version of 20.03.2023

Why and what

Welcome to this tiny course on data visualization in R with {ggplot2}! 👋

Why do we care?

Potentially, plots can beautifully inform or horribly mislead. Colors and shape matter! ⚖️

Why {ggplot2}?

The {ggplot2} package implements a grammar of graphics, a series of distinct tasks to make a graphic.

What is this course about?

Being in decent control of {ggplot2} to produce meaningful plots.

What do you need?

Basic R skills + a not-too-old version of R (>= 2.10) + RStudio

At the end of the day…

Course material

Executing the following lines in R gives you access to the course material:

install.packages("remotes")
remotes::install_github("loelschlaeger/rcourse", upgrade = "never")
library("rcourse")

To open a copy of these slides, type:

slides()

To start the practicals, type:

practicals()

Sources

Found mistakes? Have suggestions?

You can leave a note here on GitHub. 🙏

Our first plot

First we get {ggplot2}.

# install.packages(ggplot2)
library(ggplot2)

Next we need data, let’s go with an excerpt from the famous Gapminder dataset:

# install.packages(gapminder)
library(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

First, we tell the ggplot() function what data we use and what variables we wish to see on each axis:

ggplot(
  data = gapminder, 
  mapping = aes(x = gdpPercap, y = lifeExp)
) 

Something is missing … 🤔 We need an additional layer, a geom_* function!

ggplot(
  data = gapminder, 
  mapping = aes(x = gdpPercap, y = lifeExp)
) +
  geom_point()

There are more of them which we can simply add (literally add!):

p <- ggplot(
  data = gapminder, 
  mapping = aes(x = gdpPercap, y = lifeExp)
)
p <- p + geom_point() + geom_smooth()
p

As a last polishing step for now, we improve the x-axis scale and the plot labels.

p + scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP per capita",
       y = "Life expectancy in years",
       title = "Economic growth as an indicator for life expectancy",
       subtitle = "Data points are country-years",
       caption = "Source: Gapminder")

Finally, we can use the ggsave() function to save our plot:

ggsave("some_descriptive_name.pdf")

Summary of the {ggplot2} workflow

  1. Call ggplot()
  2. Set data = ...
  3. Set mapping = aes(...)
  4. Add one (or more) geom_*() functions
  5. Adjust the scale and labels

Facets and more geoms

Our goal is to plot the trajectory of life expectancy over time for each country in the gapminder data.

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
  geom_line()

This look odd, we forgot to group by country! 💡

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country))

But can you make sense of this mess? Luckily, we can additionally group by continents:

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country)) +
  facet_wrap(~continent)

Better don’t facet_wrap(~country)… 🛑 Let’s polish our plot with the things we already learned:

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
  geom_line(color = "grey", aes(group = country)) +
  geom_smooth() +
  facet_wrap(~continent) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time on five continents")

Notice that we supplied a formula to facet_wrap. This can be more advanced, for example (with facet_grid):

ggplot(data = socviz::gss_sm, mapping = aes(x = age, y = childs)) +
  geom_point(alpha = 0.2) +
  geom_smooth() +
  facet_grid(sex ~ race) +
  labs(x = "Age",
       y = "No. of children",
       title = "Relationship between age and number of children",
       subtitle = "Separated by sex (in rows) and race (in columns)")

As a last input for this part, we learn four new geoms:

Bar plots

ggplot(data = socviz::gss_sm, mapping = aes(x = religion)) +
  geom_bar()

Using relative instead of absolute counts on the y-axis is covered in the tutorials.

Histograms

ggplot(data = socviz::gss_sm, mapping = aes(x = age)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

There is a message and a warning. We will adress both in the practicals.

Density plots

library(dplyr)
ggplot(data = filter(gapminder, year == 2007), 
       mapping = aes(x = lifeExp)) +
  geom_density()

Boxplots

ggplot(data = filter(gapminder, year == 2007), 
       mapping = aes(x = pop,
                     y = reorder(continent, pop))) +
  geom_boxplot() +
  scale_x_log10() + 
  labs(y = NULL,
       x = "Populations in 2007")

We look at a variant on the basic boxplot that {ggplot2} offers in the tutorials.

Draw Maps

R can work with geographical data, and {ggplot2} can produce choropleth maps.

world <- map_data("world")
p <- ggplot(data = world, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")
plot(p)

Instead of the default Mercator projection, we can use the Albers projection:

p + coord_map(projection = "albers", lat0 = 15, lat1 = 45)

Now in the tutorials, we will visualize the results of the Trump vs. Clinton election 2016 on a map of the US states.

Challenge

Reproduce this plot! 😎

Don’t forget to install and load the packages {ggplot2} and {dplyr} and load the gapminder dataset. If you want to see some hints, scroll down this page.















Hint 1: Use your {dplyr} knowledge to create an extract of the gapminder dataset that only contains values from 2007.




Hint 2: Have a look at the 3rd slide of this presentation to copy the basic syntax and remember how to modify the labels.




Hint 3: You can set the size and colour of the points to depend on certain variables in the aesthetics aes().




Hint 4: Have a look at ?guide to modify the legends.

Animations

{ggplot2} itself does not allow for interactive or animated visualizations. However, there are (of course) packages to achieve this, e.g. {plotly}, {gganimate}, {shiny}.

plot <- ggplot(gapminder, aes(x = gdpPercap, y=lifeExp, size = pop, colour = continent)) +
    geom_point(alpha = 0.7) +
    scale_x_log10(labels = scales::dollar) +
    guides(size="none") +
    guides(colour=guide_legend(title="")) +
    labs(
      x = "GDP per capita", 
      y = "Life expectancy in years",
      title = "Economic growth as an indicator for life expectancy",
      caption = "Source: Gapminder"
    )
library(plotly)
ggplotly(plot)
library(gganimate)
library(gifski)
plot + transition_time(year) +
  labs(title = "Year: {frame_time}")